Today, we’ll explore Russians’ support the war in Ukraine using a public opinion survey from Russia conducted by Alexei Miniailo’s “Do Russians Want War” project.
The survey was conducted by phone using a random sample of mobile phone numbers to produce a sample of respondents representative of the population in terms of age, sex, and geography. It was in the field from February 28 to March 2.
We will look at how support for the war varies with the demographic predictors age, sex and education. We will see how multiple regression can be used to describe more complex relationships. From our baseline model, we will ask:
Does the relationship between age and support for the war vary among male and female respondents (Interaction model)
Does the relationship between age and support vary at different levels of age (Polynomial regression model)
Does the relationship between age and support vary at different levels of age and does the nature of this variation differ among male and female respondents (Polynomial regression model with an interaction term)
To accomplish this, we will do the following:
Get set up to work and describe our data (10 minutes)
Get practice interpreting long chunks of codes with lots of %>%s (10 minutes)
Estimate four models of increasing complexity (15 minutes)
Present these models in a regression table and interpret the results (10 minutes)
Evaluate the relative performance of these models in terms of their variance explained \(R^2\)’s (15 minutes)
Produce predicted values to help us interpret and compare our baseline model to a model where the “effect” of an increase in age changes with age (10 minutes)
Produce predicted values to help us interpret and compare models where the “effect” of age is allowed to vary with respondent’s sex (10 minutes)
Finally, if there’s time, we will:
One of questions 1-7 will be randomly selected as the graded question for the lab.
You will work in your assigned groups. Only one member of each group needs to submit the html file produced by knitting the lab.
This lab must contain the names of the group members in attendance.
If you are attending remotely, you will submit your labs individually.
You can find your assigned groups in previous labs
Broadly, our goals in this assignment are to:
Get comfortable estimating and interpreting regression models with interaction terms
Including an interaction between two variables is useful if we think the “effect” of one variable depends on the value of another variable.
Today, we test whether the association between support for the war and age differs between male and female respondents.
Get comfortable estimating and interpreting regression models with polynomial terms
Polynomial terms are a way of incorporating non-linearity in our predictors.1
If we think the relationship between a predictor and an outcome varies at different levels of the predictor, we can include a polynomial term(s) for that predictor. Now our model will describe the relationship between \(x\) and \(y\) with a curve (varying slope), rather than a line (constant slope)
Learn how to generate predicted values from our model to help us interpret complex regression models
Regression tables are great for summarizing basic regression models.
Models with interaction terms or polynomial terms (or both) are difficult to interpret by looking at just the coefficients themselves
Producing predicted values that show how the model’s predictions change as variables in a interaction or polynomial change help us understand what these coefficients are telling us.
Practice comparing nested models using the proportion of the variance explained by each model.
Recall that a model’s \(R^2\) describes the proportion of the total variance in our outcome, explained by the model’s predictions.
When we add predictors to a model, the model becomes more flexible, and can explain more variance in the outcome.
adjusted \(R^2\) adjusts models’s \(R^2\) by penalizing models for total number of predictors needed to explain given amount of variance
When models are nested (the predictors in a smaller model are a subset of the predictors in a larger model), we can compare the relative performance of the two using tools like \(R^2\) and adjusted \(R^2\).
When comparing models, we make trade offs between our desire to explain as much variation in the outcome as possible with a general preference for more parsimonious models.2
As with every lab, you should:
author: section of the YAML header to include the names of your group members in attendance.First lets load the libraries we’ll need for today.
the_packages <- c(
## R Markdown
"kableExtra","DT","texreg","htmltools",
"flair", # Comments only
## Tidyverse
"tidyverse", "lubridate", "forcats", "haven", "labelled",
## Extensions for ggplot
"ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr",
"GGally", "scales", "dagitty", "ggdag", "ggforce",
# Data
"COVID19","maps","mapdata","qss","tidycensus", "dataverse",
# Analysis
"DeclareDesign", "zoo"
)
# Define function to load packages
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(the_packages)
kableExtra DT texreg htmltools flair
TRUE TRUE TRUE TRUE TRUE
tidyverse lubridate forcats haven labelled
TRUE TRUE TRUE TRUE TRUE
ggmap ggrepel ggridges ggthemes ggpubr
TRUE TRUE TRUE TRUE TRUE
GGally scales dagitty ggdag ggforce
TRUE TRUE TRUE TRUE TRUE
COVID19 maps mapdata qss tidycensus
TRUE TRUE TRUE TRUE TRUE
dataverse DeclareDesign zoo
TRUE TRUE TRUE
Next we’ll load the recoded data for the lab by sourcing a script called drww_english_recode.R to download the raw data and recode the Cyrilic into English.
If you were click on the link (you don’t need to, but you might be interested), you’ll see the source code which:
Downloads from GitHub the raw data from the DRWW project and loads it into R using the foreign package’s read.spss() function because the data are saved as a .sav file.
Creates df_drww from raw_drww, giving the columns (variables) in raw_drww English translations in df_drww
Recodes the values of variable in df_drww into numeric (_n suffixes) and factor (_f suffixes) variables (with levels that are in English)
Constructs some additional variables like total_social_media_use which is the sum of total number of social media sites a respondent reports using.
source("https://gist.githubusercontent.com/PaulTestaBrown/7565987b8eedc743fa5a57e451abed40/raw/601680d7b20cd1638d93c325156d4b000ea1f9cc/drww_english_recode.R")
# If this doesn't work you can download the data here:
# load(url("https://pols1600.paultesta.org/files/data/df_drww.rda"))
As always, it’s important to get a high level overview of data when we first load it into R.
Below we take a look at the first few values of all the data. You’ll see that df_drww includes both the Russian data and recoded revisions of the data (which are typically appended with _n for numeric data or _f for factor data).
glimpse(df_drww)
Rows: 1,807
Columns: 42
$ sex <chr> "Male", "Male", "Female", "Female", "Male…
$ age <dbl> 99, 78, 73, 73, 69, 69, 59, 54, 49, 48, 4…
$ support_war <fct> "(НЕ ЗАЧИТЫВАТЬ) Затрудняюсь ответить", "…
$ trust_gov <fct> (НЕ ЗАЧИТЫВАТЬ) Затрудняюсь ответить, Дов…
$ employ_working <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,…
$ employ_student <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ employ_retired <dbl> 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,…
$ employ_maternity_leave <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ employ_homemaker <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ employ_unemployed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,…
$ employ_other_employment <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ other_employ_open_response <chr> " …
$ education <fct> (НЕ ЗАЧИТЫВАТЬ) Затрудняюсь ответить, Выс…
$ social_classmates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_in_contact_with <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_facebook <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_instagram <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_twitter <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_telegram <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_whatsapp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_viber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_tiktok <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,…
$ social_other_social_networks <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ none_social <fct> НЕ ВЫБРАН, ВЫБРАН, НЕ ВЫБРАН, ВЫБРАН, НЕ …
$ dk_social <fct> ВЫБРАН, НЕ ВЫБРАН, НЕ ВЫБРАН, НЕ ВЫБРАН, …
$ other_social_open_response <chr> " …
$ other_social_group <fct> NA, NA, Ютуб, NA, Ютуб, NA, NA, NA, NA, N…
$ geo_urban_rural <fct> NA, NA, "город, поселок городского типа",…
$ geo_district <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ weight <dbl> 0.000, 0.000, 1.208, 0.000, 1.231, 1.208,…
$ support_war_f <fct> NA, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes…
$ support_war01 <dbl> NA, 1, 1, 1, 1, 1, 0, 1, 1, NA, 1, 1, 1, …
$ trust_gov_f <fct> NA, Partly, Largely, Fully, Partly, Fully…
$ trust_gov_n <dbl> NA, 1, 2, 3, 1, 3, 0, NA, 2, NA, NA, 0, 3…
$ education_f <fct> NA, College or some college, Vocational, …
$ education_n <dbl> NA, 4, 3, 4, 1, 3, 4, NA, 4, NA, 3, 3, 3,…
$ geo_urban_rural_f <fct> NA, NA, Urban, NA, Rural, Urban, Rural, N…
$ geo_district_f <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ social_youtube <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ social_yandex <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ total_social_media_use <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 9, 0, 1, 0, 1, 0,…
$ no_social_media <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
In the first part of this lab, we’ll work with the following variables
support_war01 “Please tell me, do you support or do not support Russia’s military actions on the territory of Ukraine?” (1=yes, 0 = no)
age “How old are you?”
sex “Gender of respondent” (As assessed by the interviewer)
education_n “What is your highest level of education (confirmed by a diploma, certificate)?” (1 = Primary school, 2 = “High School”, 3 = “Vocational School” 4 = “College”, 5 = Graduate Degree)3
In the code chunk below, I create a data frame of summary statistics:
the_vars <- c("support_war01","age", "is_female", "education_n")
df_drww %>%
mutate(
is_female = ifelse(sex == "Female",1, 0)
) %>%
select(all_of(the_vars)) %>%
pivot_longer(
cols = all_of(the_vars ),
names_to = "Variable",
values_to = "Value"
) %>%
group_by(Variable) %>%
summarise(
`N obs` = n(),
Missing = sum(is.na(Value)),
Min = min(Value, na.rm = T),
`25th perc` = quantile(Value, .25, na.rm=T),
Mean = mean(Value, na.rm=T),
Median = median(Value, na.rm = T),
`75th perc` = quantile(Value, .75, na.rm=T),
Max = max(Value, na.rm = T)
) %>%
mutate(
Variable = case_when(
Variable == "age" ~ "Age",
Variable == "education_n" ~ "Education",
Variable == "is_female" ~ "Female",
Variable == "support_war01" ~ "Support for War",
),
Variable = factor(Variable, levels = c("Support for War","Age","Female","Education"))
) %>%
arrange(Variable) -> summary_table
summary_table
# A tibble: 4 × 9
Variable `N obs` Missing Min `25th perc` Mean Median `75th perc` Max
<fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Support for… 1807 334 0 0 0.720 1 1 1
2 Age 1807 0 18 34 46.6 45 60 99
3 Female 1807 0 0 0 0.470 0 1 1
4 Education 1807 13 1 3 3.17 3 4 5
Which we can then format into a table of summary statistics:
kable(summary_table,
digits = 2) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
kableExtra::pack_rows(
group_label = "Outcome",
start_row = 1,
end_row = 1
)%>%
kableExtra::pack_rows(
group_label = "Predictors",
start_row = 2,
end_row = 4
)
| Variable | N obs | Missing | Min | 25th perc | Mean | Median | 75th perc | Max |
|---|---|---|---|---|---|---|---|---|
| Outcome | ||||||||
| Support for War | 1807 | 334 | 0 | 0 | 0.72 | 1 | 1 | 1 |
| Predictors | ||||||||
| Age | 1807 | 0 | 18 | 34 | 46.65 | 45 | 60 | 99 |
| Female | 1807 | 0 | 0 | 0 | 0.47 | 0 | 1 | 1 |
| Education | 1807 | 13 | 1 | 3 | 3.17 | 3 | 4 | 5 |
Please use this table to describe a typical respondent to the survey
YOUR DESCRIPTION HERE
The previous section contained a fairly long chunk of code.
I suspect you had little trouble interpreting the table of summary statistics produced by the second code chunk (summary_table), but may not have followed everything that was happening in the first code chunk (summary_stats).
In this section, I want you to practice interpreting large chunks of code with lots of %>%
Recall, that the pipe command takes the outcome from a function that comes before the %>% and pipes it into the first argument of the function that comes after it.
Each function in summary_stats expects a data.frame in its first argument. The function then does something to this inputted data frame (add a column, pivot columns, etc) and returns a new data.frame which %>% then passes on to next function.
To learn how to produce a table of summary statistics (a common task for empirical work), it’s helpful to understand what’s happening at each step of the process in the summary_stats code chunk.
In the code chunk below, starting from df_drww %>% please
Copy and paste the all the code up until the next %>%
Run this code in your console
Write a comment above this code in code chunk
Running code line-by-line, is great way to learn how to code, by understanding what happens in each line of code.
There are about 8 or 9 steps of code to explain (depending on how you copy and paste.) I’ve done steps 1-3 below. Please take the next 10 minutes or so to fill in the rest.
# ---- Step 1 -----
# Create a vector of variables we want to summarize
the_vars <- c("support_war01","age", "is_female", "education_n")
# ---- Step 2 -----
# From df_drww create a new variable `is_female` that is 1 when sex == "Female, and 0 otherwise
df_drww %>%
mutate(
is_female = ifelse(sex == "Female",1, 0)
)
# ---- Step 3 -----
# Select just the columns we want to summarize
df_drww %>%
mutate(
is_female = ifelse(sex == "Female",1, 0)
) %>%
select(all_of(the_vars))
# ---- Step 4 -----
# Comment
# ---- Step 5 -----
# Comment
# ---- Step 6 -----
# Comment
# ---- Step 7 -----
# Comment
# ---- Step 8 -----
# Comment
# ---- Step 9 -----
# Comment
Today, we will estimate four models to explore how Russian support for the war in Ukraine varies with demographic predictors.
m1 provides a baseline model that predicts support for the war as a linear function of age, sex, and education_n measured as numeric scale (1 = Primary School, 5 = Graduate Degree).
m2 is an interaction regression model that asks whether the relationship between age and support varies by sex
m3 is a polynomial regression model, that includes a term for “age-squared” which allows the relationship between age and support to vary over different levels of age
m4 adds an interaction term to the polynomial regression from m3 essentially allowing for separate curves for male and female respondents.
Before you estimate these models, please answer the following:
In the baseline model, m1 what do you expect the sign of the coefficient for each predictor to be:
Age (Positive/Negative)
Sex (Positive/Negative)4
Education (Positive/Negative)
In the interaction model, m2
Do you think the relationship between age and support will vary by sex (Yes/No)
If you said yes, do you think the coefficient on the interaction between age and sex will be positive or negative (Positive/Negative)
In the polynomial model, m3
If the coefficient on age is positive and the coefficient on age^2 is positive, then as age increases, the increase in the predicted level of support will be (increasing/decreasing)
If the coefficient on age is positive and the coefficient on age^2 is negative, then as age increases, the increase in the predicted level of support will be (increasing/decreasing)
In m4
sex variable and the age variables (age and age^2) is statistically significant, this implies that the relationship between age and support for the war is (similar/different) for male and female respondents.Uncomment the code below, and replace the ??? with the appropriate terms to fit the following models.
# # Baseline Model
# m1 <- lm(support_war01 ~ age + sex + education_n, df_drww)
#
# # Interaction model: Allow coefficient for age to vary with sex
# m2 <- lm(support_war01 ~ age*??? + education_n, df_drww)
#
# # Polynomial model: Allow coefficient for age to vary by age
# m3 <- lm(support_war01 ~ age + I(???^2) + sex + education_n, df_drww)
#
# # Separate Polynomial: Allow coefficient for age to vary by age separately by sex
# m4 <- lm(support_war01 ~ (age + I(???))*??? + education_n, df_drww)
Using code from previous labs and lectures, please display the models from the previous section in a regression table and offer some initial interpretations.
You can use the code from last week’s lab as a guide.
digits = 4 to htmlreg# Display m1, m2, m3, m4 in a regression table
In particular, please answer the following:
For m1 (Model 1) how do predicted levels of support for the war, change with
Age Write a sentence or two here describing the relationship in terms a human can understand.
Sex
Education
For m2(Model 2) does the relationship between age and support appear to differ for male and female respondents (Yes/No)
For m3 (Model 3) Does the relationship between age and support appear to be constant or does the predicted marginal effect of an increase in age differ5
For m4 (Model 4) do the varying marginal effects of age appear to also vary by gender?6
Now take a look at the bottom rows in your regression table which show each model’s \(R^2\) and adjusted \(R^2\).
Recall from class, that \(R^2\) describes the proportion of the variance in our outcome (support for the war), explained by the predictors in our model (age, sex, education, and some interactions/polynomials).
You may also remember that \(R^2\) always increases as we add more predictors to the model. To account for this, we often look at the adjusted \(R^2\) which weights the increase in variance explained (decrease in variance unexplained) by the number of additional predictors needed to produce that increase.
You regression table only reports results to a few decimal places.
Let’s use the summary() function to extract and save the \(R^2\) (r.squared) and adjusted \(R^2\) (adj.r.squared) for each model. The code for m1 is there as an example.
# m1
m1_r_squared <- summary(m1)$r.squared
Error in summary(m1): object 'm1' not found
m1_r_squared
Error in eval(expr, envir, enclos): object 'm1_r_squared' not found
m1_adj_r_squared <- summary(m1)$adj.r.squared
Error in summary(m1): object 'm1' not found
m1_adj_r_squared
Error in eval(expr, envir, enclos): object 'm1_adj_r_squared' not found
# m2
# m3
# m4
Please answer the following:
How does the \(R^2\) change from m1 to m4
How does the adjusted \(R^2\) change from m1 to m4
Overall, which model should we prefer?
Now let’s produce and visualize predicted values to help us interpret the relationship between age and support for the war m1 (our baseline model) and m3 (our polynomial model)
We demonstrated the steps for producing predicted values in Tuesday’s lecture
To review: you’ll need to
pred_df)expand_grid()age using seq()sex and education constant at typical valuespred_dfpred_df$fit_m1 <- predict(m1, newdata = pred_df)pred_df as your dataage to x axisfit_m1 or fit_m3 to the y axisgeom_line()# 2. Create a prediction data frame (`pred_df`)
# 3 Use this prediction data frame to obtain predicted values from our model
# Save the output from predict() back into our prediction data frame
## predicted values for m1
## predicted values for m3
# 4. Visualize the results using ggplot.
## plot predicted values for m1
## plot predicted values for m3
Please offer a brief interpretation of these two figures
Finally, let’s see how to produce and interpret predictions for models m2 and m4.
Recall, that in m2 we allowed the coefficient on age to vary by sex and in m4 our model estimates separate age curves for male and female respondents.
In the code chunk below
pred_df_int)expand_grid()age using seq()sex = c("Male","Female")education constant at a typical valuepred_df_intpred_df_int$fit_m2 <- predict(m2, newdata = pred_df_int)pred_df_int as your dataage to x axisfit_m2 or fit_m4 to the y axissex to col to produce separate lines by the values of sexgeom_line()# 2. Create a prediction data frame (`pred_df_int`)
# 3. Use this prediction data frame to obtain predicted values from our model
# Save the output from predict() back into our prediction data frame
## predicted values for m2
## predicted values for m4
# 4. Visualize the results using ggplot.
## plot predicted values for m2
## plot predicted values for m4
Again,please offer a brief interpretation of these two figures
Finally, if you’re finished with all the sections above, take some time to explore the data:
For fun, let’s say the group that produces the most combination of question/model/interpretation, gets some candy next class.
Some questions you might explore:
Please take a few moments to complete the class survey for this week.
This week we’re asking questions you submitted in last weeks survey. To encourage participation, let’s say that if we get a 100 percent completion rate, I will bring donuts to class on Tuesday.
If you really hate these surveys but like donuts, you can just click through without answering any of the questions. As long as we get to 24 responses, donuts for all.
Mathematically, recall that the slope/first derivative of the line \(y = f(x) = 2x\) is constant \((f'(x) = 2)\). If we increase x by 1, we expect y to increase by 2, while the derivative of a parabola \(y = f(z) = z^2\) varies with \(z\) \((f'(z) = 2x)\). Going from z= 2 to z= 3 is associated with a greater increase in y, then going from z=1 to z=2. Our model, however, is still linear in parameters \(\beta\). That is, it is still a linear regression. If our model included some parameter \(\theta^2\), then it would be a non-linear regression.↩︎
In a machine learning framework, we trying to find an optimal tradeoff between reducing bias in our predictors by including more predictors and minimizing variance in predictions by not overfitting the data.↩︎
I think, google translate was a bit unclear. But higher numbers equal more education.↩︎
This is tricky, you need to know what the reference (excluded) category will be. ↩︎
Basically, I’m asking whether the coefficient on I(age^2) is statistically significant. If it is, then the change in predicted support for the war among say 20-year-olds compared to 30-year-olds, would be different than change between 30- to 40-year-olds. Interpreting polynomials terms and interaction models is much easier if, as we do later, we simply obtain and plot the predicted values from this model↩︎
As in the previous question, basically you need to look at the table and see if the coefficients on the interaction terms are statistically significant. It’s a little more complicated than this, but if they are significant, this is evidence of differences across Sex.↩︎